matthews correlation coefficient
Appendix for the Paper: " SizeShiftReg: a Regularization Method for Improving Size-Generalization in Graph Neural Networks "
This Appendix contains the following additional material. In Section A we report more statistics regarding the considered datasets. E we show the overhead (in terms of training time) incurred by our method. Table 1: Dataset statistics, this table is taken from Y ehudai et al. [13], Bevilacqua et al. [1]. In Figure 1 we show the continuation of Figure 2 (b) from the main paper, including plots on all datasets.
Deep Learning for Glioblastoma Morpho-pathological Features Identification: A BraTS-Pathology Challenge Solution
Zhang, Juexin, Weng, Ying, Chen, Ke
Glioblastoma, a highly aggressive brain tumor with diverse molecular and pathological features, poses a diagnostic challenge due to its heterogeneity. Accurate diagnosis and assessment of this heterogeneity are essential for choosing the right treatment and improving patient outcomes. Traditional methods rely on identifying specific features in tissue samples, but deep learning offers a promising approach for improved glioblastoma diagnosis. In this paper, we present our approach to the BraTS-Path Challenge 2024. We leverage a pre-trained model and fine-tune it on the BraTS-Path training dataset. Our model demonstrates poor performance on the challenging BraTS-Path validation set, as rigorously assessed by the Synapse online platform. The model achieves an accuracy of 0.392229, a recall of 0.392229, and a F1-score of 0.392229, indicating a consistent ability to correctly identify instances under the target condition. Notably, our model exhibits perfect specificity of 0.898704, showing an exceptional capacity to correctly classify negative cases. Moreover, a Matthews Correlation Coefficient (MCC) of 0.255267 is calculated, to signify a limited positive correlation between predicted and actual values and highlight our model's overall predictive power. Our solution also achieves the second place during the testing phase.
Interactive Classification Metrics: A graphical application to build robust intuition for classification model evaluation
Brown, David H., Chicco, Davide
Machine learning continues to grow in popularity in academia, in industry, and is increasingly used in other fields. However, most of the common metrics used to evaluate even simple binary classification models have shortcomings that are neither immediately obvious nor consistently taught to practitioners. Here we present Interactive Classification Metrics (ICM), an application to visualize and explore the relationships between different evaluation metrics. The user changes the distribution statistics and explores corresponding changes across a suite of evaluation metrics. The interactive, graphical nature of this tool emphasizes the tradeoffs of each metric without the overhead of data wrangling and model training. The goals of this application are: (1) to aid practitioners in the ever-expanding machine learning field to choose the most appropriate evaluation metrics for their classification problem; (2) to promote careful attention to interpretation that is required even in the simplest scenarios like binary classification. Our application is publicly available for free under the MIT license as a Python package on PyPI at https://pypi.org/project/interactive-classification-metrics and on GitHub at https://github.com/davhbrown/interactive_classification_metrics.
Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models
Doering, Nigel, Gorlla, Cyril, Tuttle, Trevor, Vijay, Adhvaith
Fine-tuning large pre-trained language models for downstream tasks remains a critical challenge in natural language processing. This paper presents an empirical analysis comparing two efficient fine-tuning methods - BitFit and adapter modules - to standard full model fine-tuning. Experiments conducted on GLUE benchmark datasets (MRPC, COLA, STS-B) reveal several key insights. The BitFit approach, which trains only bias terms and task heads, matches full fine-tuning performance across varying amounts of training data and time constraints. It demonstrates remarkable stability even with only 30\% of data, outperforming full fine-tuning at intermediate data levels. Adapter modules exhibit high variability, with inconsistent gains over default models. The findings indicate BitFit offers an attractive balance between performance and parameter efficiency. Our work provides valuable perspectives on model tuning, emphasizing robustness and highlighting BitFit as a promising alternative for resource-constrained or streaming task settings. The analysis offers actionable guidelines for efficient adaptation of large pre-trained models, while illustrating open challenges in stabilizing techniques like adapter modules.
Algorithms for automatic intents extraction and utterances classification for goal-oriented dialogue systems
Legashev, Leonid, Shukhman, Alexander, Zhigalov, Arthur
Modern machine learning techniques in the natural language processing domain can be used to automatically generate scripts for goal-oriented dialogue systems. The current article presents a general framework for studying the automatic generation of scripts for goal-oriented dialogue systems. A method for preprocessing dialog data sets in JSON format is described. A comparison is made of two methods for extracting user intent based on BERTopic and latent Dirichlet allocation. A comparison has been made of two implemented algorithms for classifying statements of users of a goal-oriented dialogue system based on logistic regression and BERT transformer models. The BERT transformer approach using the bert-base-uncased model showed better results for the three metrics Precision (0.80), F1-score (0.78) and Matthews correlation coefficient (0.74) in comparison with other methods.
Does the evaluation stand up to evaluation? A first-principle approach to the evaluation of classifiers
Dyrland, K., Lundervold, A. S., Mana, P. G. L. Porta
How can one meaningfully make a measurement, if the meter does not conform to any standard and its scale expands or shrinks depending on what is measured? In the present work it is argued that current evaluation practices for machine-learning classifiers are affected by this kind of problem, leading to negative consequences when classifiers are put to real use; consequences that could have been avoided. It is proposed that evaluation be grounded on Decision Theory, and the implications of such foundation are explored. The main result is that every evaluation metric must be a linear combination of confusion-matrix elements, with coefficients - "utilities" - that depend on the specific classification problem. For binary classification, the space of such possible metrics is effectively two-dimensional. It is shown that popular metrics such as precision, balanced accuracy, Matthews Correlation Coefficient, Fowlkes-Mallows index, F1-measure, and Area Under the Curve are never optimal: they always give rise to an in-principle avoidable fraction of incorrect evaluations. This fraction is even larger than would be caused by the use of a decision-theoretic metric with moderately wrong coefficients.
Goodness of Fit Metrics for Multi-class Predictor
The multi-class prediction had gained popularity over recent years. Thus measuring fit goodness becomes a cardinal question that researchers often have to deal with. Several metrics are commonly used for this task. However, when one has to decide about the right measurement, he must consider that different use-cases impose different constraints that govern this decision. A leading constraint at least in \emph{real world} multi-class problems is imbalanced data: Multi categorical problems hardly provide symmetrical data. Hence, when we observe common KPIs (key performance indicators), e.g., Precision-Sensitivity or Accuracy, one can seldom interpret the obtained numbers into the model's actual needs. We suggest generalizing Matthew's correlation coefficient into multi-dimensions. This generalization is based on a geometrical interpretation of the generalized confusion matrix.
Evaluation metrics: leave your comfort zone and try MCC and Brier Score
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Machine learning instead is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Data scientist around the world apply machine learning in order to build models able to predict future events, cluster people/objects into similar groups and also identify unexpected anomalies. Every enthusiastic data scientist knows that the most exciting part of machine learning is to choose the coolest algorithm capable of solve the problem of interest (Supervised or Unsupervised).
Extracting Angina Symptoms from Clinical Notes Using Pre-Trained Transformer Architectures
Eisman, Aaron S., Shah, Nishant R., Eickhoff, Carsten, Zerveas, George, Chen, Elizabeth S., Wu, Wen-Chih, Sarkar, Indra Neil
Anginal symptoms can connote increased cardiac risk and a need for change in cardiovascular management. This study evaluated the potential to extract these symptoms from physician notes using the Bidirectional Encoder from Transformers language model fine-tuned on a domain-specific corpus. The history of present illness section of 459 expert annotated primary care physician notes from consecutive patients referred for cardiac testing without known atherosclerotic cardiovascular disease were included. Notes were annotated for positive and negative mentions of chest pain and shortness of breath characterization. The results demonstrate high sensitivity and specificity for the detection of chest pain or discomfort, substernal chest pain, shortness of breath, and dyspnea on exertion. Small sample size limited extracting factors related to provocation and palliation of chest pain. This study provides a promising starting point for the natural language processing of physician notes to characterize clinically actionable anginal symptoms. Introduction Angina pectoris is a constellation of symptoms that portends inadequate oxygenation of cardiac muscle due to either a decrease in coronary blood supply, an increase in myocardial oxygen demand, or both.